FXN: Save graphical outputs
picsave <- function(graph, name) {
ggsave(plot = graph, filename= name, device = "pdf", width = 12, height = 8, path = "~/GitHub/S_Lipkind_Rundergrad2020/week3/pics/")
}
my_variable <- 10
my_varıable
## [1] 10
#> Error in eval(expr, envir, enclos): object 'my_varıable' not found
Was this written by someone who speaks Turkish or something? Not sure how else someone could use ı instead of i.
a <- ggplot(data = mpg) + #dota -> data
geom_point(mapping = aes(x = displ, y = hwy))
#picsave(a, "4.4.2 graph.pdf")
filter(mpg, cyl == 8) #= -> ==
filter(diamonds, carat > 3) # diamond -> diamonds
Oh my goodness, that is amazing. An entire list of keyboard shortcuts. You could also reach that page by going to Help > Keyboard Shortcuts Help.
Find all flights that…
#### 1.1: had an arrival delay of two or more hours
head(flights)
filter(flights, arr_delay >= 2)
#### 1.4: Departed in summer (July, August, and September)
filter(flights, month %in% c(7,8,9))
#### 1.5: Arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 2 & dep_delay == 0)
#### 1.7: Departed between midnight and 6am (inclusive)
(d <- filter(flights, dep_time >= 0, dep_time <= 600))
#### 2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
?between It’s an inclusive shortcut to find values within a certain range.
e <- filter(flights, dep_time %in% between(dep_time,0, 600))
# d == e
#### 3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
filter(flights, dep_time %in% NA) # 8255 rows/flights
#these observations also all have NA dep_delay, arr_time, arr_delay.
#Hypothesis: cancelled flights
arrange(flights, desc(is.na(dep_delay))) #this works, but is it the intended solution?
arrange(flights, dep_time, desc(dep_delay))
y <- arrange(flights, desc(distance), air_time)
#(select(y, distance, air_time)) -> double-checking
(longest_distance <- top_n(flights, 10, distance))
(shortest_distance <- top_n(flights, -10, distance))
#one option: select()
select(flights, day, month, day, month, dep_delay, dep_delay)
Repeating variable names does not appear to make a difference. Only the sorting of the initial appearance of each name within the list matters.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
one_of() allows one to make a character vector with specific column names that you can then select for.
select(flights, contains("TIME"))
The results aren’t too surprising, though I didn’t realize select was not case-sensitive. If I wanted to specify case, I could add the specifier below:
select(flights, contains("time", ignore.case = FALSE))
head(flights)
minutes <- function(miltime) {
miltime %/% 100 * 60 + miltime %% 100
}
mod <- flights %>%
select(dep_time,sched_dep_time) %>%
lapply(.,minutes) %>%
as_tibble() %>%
rename(
mod_dep_time = dep_time,
mod_sched_dep_time = sched_dep_time)
modflights <- flights %>% #I initially tried to join flights and mod together, but I couldn't figure out the 'by' part.
mutate(
mod_dep_time = mod$mod_dep_time,
mod_sched_dep_time = mod$mod_sched_dep_time)
discrepancy <- flights %>%
mutate(
arr_dep = arr_time - dep_time) %>%
select(arr_dep, air_time) %>%
lapply(.,minutes) %>%
glimpse(.)
## List of 2
## $ arr_dep : num [1:336776] 193 197 261 300 178 146 238 112 201 155 ...
## $ air_time: num [1:336776] 147 147 120 143 76 110 118 53 100 98 ...
#Theoretically, you would expect the values to be the same, but they are not.
#Perhaps there's an issue as a result of military time? When I converted to minutes, though, the difference remained. Air time is shorter than arr_dep.
#Hypothesis: The difference in time is explained by the time the planes take to embark and disembark.
#How to test: add dep_delay and arr_delay to ait_time, see if air_time == arr_dep.
head(flights)
flights %>%
mutate(
air_time_delay = air_time + arr_delay + dep_delay,
arr_dep = discrepancy$arr_dep) %>%
select(air_time,air_time_delay,arr_dep, dep_delay, arr_delay) %>%
lapply(., minutes) %>%
glimpse(.)
## List of 5
## $ air_time : num [1:336776] 147 147 120 143 76 110 118 53 100 98 ...
## $ air_time_delay: num [1:336776] 160 171 155 124 85 118 132 36 89 104 ...
## $ arr_dep : num [1:336776] 153 157 181 180 138 106 158 72 121 115 ...
## $ dep_delay : num [1:336776] 2 4 2 39 34 36 35 37 37 38 ...
## $ arr_delay : num [1:336776] 11 20 33 22 15 12 19 26 32 8 ...
#However, that doesn't seem to work, either, and the difference between arr_dep and air_time is not explained neatly by either dep_delay or arr_delay.
#solution 1: top_n()
flights %>%
mutate(
delays4dayz = dep_delay + arr_delay) %>%
top_n(.,10, delays4dayz) %>%
glimpse(.)
## Observations: 10
## Variables: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month <int> 1, 1, 12, 3, 4, 5, 6, 7, 7, 9
## $ day <int> 9, 10, 5, 17, 10, 3, 15, 22, 22, 20
## $ dep_time <int> 641, 1121, 756, 2321, 1100, 1133, 1432, 845, 2257, 1139
## $ sched_dep_time <int> 900, 1635, 1700, 810, 1900, 2055, 1935, 1600, 759, 1845
## $ dep_delay <dbl> 1301, 1126, 896, 911, 960, 878, 1137, 1005, 898, 1014
## $ arr_time <int> 1242, 1239, 1058, 135, 1342, 1250, 1607, 1044, 121, ...
## $ sched_arr_time <int> 1530, 1810, 2020, 1020, 2211, 2215, 2120, 1815, 1026...
## $ arr_delay <dbl> 1272, 1109, 878, 915, 931, 875, 1127, 989, 895, 1007
## $ carrier <chr> "HA", "MQ", "AA", "DL", "DL", "MQ", "MQ", "MQ", "DL"...
## $ flight <int> 51, 3695, 172, 2119, 2391, 3744, 3535, 3075, 2047, 177
## $ tailnum <chr> "N384HA", "N517MQ", "N5DMAA", "N927DA", "N959DL", "N...
## $ origin <chr> "JFK", "EWR", "EWR", "LGA", "JFK", "EWR", "JFK", "JF...
## $ dest <chr> "HNL", "ORD", "MIA", "MSP", "TPA", "ORD", "CMH", "CV...
## $ air_time <dbl> 640, 111, 149, 167, 139, 112, 74, 96, 109, 354
## $ distance <dbl> 4983, 719, 1085, 1020, 1005, 719, 483, 589, 762, 2586
## $ hour <dbl> 9, 16, 17, 8, 19, 20, 19, 16, 7, 18
## $ minute <dbl> 0, 35, 0, 10, 0, 55, 35, 0, 59, 45
## $ time_hour <dttm> 2013-01-09 09:00:00, 2013-01-10 16:00:00, 2013-12-0...
## $ delays4dayz <dbl> 2573, 2235, 1774, 1826, 1891, 1753, 2264, 1994, 1793...
#If one were to instead use min_rank(), ties would be settled by choosing the minimum of the "corresponding indices," whatever that means.
1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5 7 9 8 10 12 11
#Returns: 2 4 6 5 7 9 8 10 12 11
Because 1:3 is shorter than 1:10, 1:3 “loops” as it adds along 1:10.
(1 2 3 1 2 3 1 2 3 1) + (1 2 3 4 5 6 7 8 9 10)
sin(pi)
## [1] 1.224606e-16
cos(pi)
## [1] -1
tan(pi)
## [1] -1.224647e-16
#sec(pi) Not included
#csc(pi) Not included
#cot(pi) Not included